Categorical variables with many categories are preferentially selected in bootstrap-based model selection procedures for multivariable regression models.
نویسندگان
چکیده
Automated variable selection procedures, such as backward elimination, are commonly employed to perform model selection in the context of multivariable regression. The stability of such procedures can be investigated using a bootstrap-based approach. The idea is to apply the variable selection procedure on a large number of bootstrap samples successively and to examine the obtained models, for instance, in terms of the inclusion of specific predictor variables. In this paper, we aim to investigate a particular important problem affecting this method in the case of categorical predictor variables with different numbers of categories and to give recommendations on how to avoid it. For this purpose, we systematically assess the behavior of automated variable selection based on the likelihood ratio test using either bootstrap samples drawn with replacement or subsamples drawn without replacement from the original dataset. Our study consists of extensive simulations and a real data example from the NHANES study. Our main result is that if automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect. Importantly, variables with no effect and many categories may be (wrongly) preferred to variables with an effect but few categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks.
منابع مشابه
Subsampling versus bootstrapping in resampling-based model selection for multivariable regression.
In recent years, increasing attention has been devoted to the problem of the stability of multivariable regression models, understood as the resistance of the model to small changes in the data on which it has been fitted. Resampling techniques, mainly based on the bootstrap, have been developed to address this issue. In particular, the approaches based on the idea of "inclusion frequency" cons...
متن کاملTree-Structured Modelling of Categorical Predictors in Regression
Generalized linear and additive models are very efficient regression tools but the selection of relevant terms becomes difficult if higher order interactions are needed. In contrast, tree-based methods also known as recursive partitioning are explicitly designed to model a specific form of interaction but with their focus on interaction tend to neglect the main effects. The method proposed here...
متن کاملDeterminants of Inflation in Selected Countries
This paper focuses on developing models to study influential factors on the inflation rate for a panel of available countries in the World Bank data base during 2008-2012. For this purpose, Random effect log-linear and Ordinal logistic models are used for the analysis of continuous and categorical inflation rate variables. As the original inflation rate response to variables shows an appar...
متن کاملOn properties of predictors derived with a two-step bootstrap model averaging approach - A simulation study in the linear regression model
In many applications of model selection there is a large number of explanatory variables and thus a large set of candidate models. Selecting one single model for further inference ignores model selection uncertainty. Often several models fit the data equally well. However, these models may differ in terms of the variables included and might lead to different predictions. To account for model se...
متن کاملA survey of variable selection methods in two Chinese epidemiology journals
BACKGROUND Although much has been written on developing better procedures for variable selection, there is little research on how it is practiced in actual studies. This review surveys the variable selection methods reported in two high-ranking Chinese epidemiology journals. METHODS Articles published in 2004, 2006, and 2008 in the Chinese Journal of Epidemiology and the Chinese Journal of Pr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Biometrical journal. Biometrische Zeitschrift
دوره 58 3 شماره
صفحات -
تاریخ انتشار 2016